winsorize
2022-03-15 · 3 min read
Source: https://www.wikiwand.com/en/Winsorizing
Cheap way to remove outliers from samples.
- TL;DR:
winsorize(X, q in [0,0.5])
replaces all elements inX
below quantileq
and above quantile1-q
with the quantileq
and1-q
respectively. - Just clamps outliers above and below a quantile.
- Effectively removes outliers without changing the number of samples (length of
X
). - Used in cargo's default bench harness when taking timing samples.
Example Implementation #
Source: https://github.com/phlip9/idris2-aoc21/blob/master/src/Bench.idr
import Data.List
import Data.List1
import Data.String
import Data.Vec
import Scratch
import System.Clock
import System.Random
%default total
||| Clamp `x` to the range `[min, max]`. Assumes `min <= max`.
clamp : Ord a => (min : a) -> (max : a) -> (x : a) -> a
clamp min max x = if x > max then max
else if x < min then min
else x
||| Linearly interpolate between `x0` and `x1` according to the scale parameter `t`.
||| For example, `lerp` returns `x0` when `t = 0` and `x1` when `t = 1`.
lerp : Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
lerp x0 x1 t = (1 - t) * x0 + (t * x1)
||| Linearly interpolate between `x0` and `x1` then clamp in the range `[x0, x1]`.
||| Useful for handling scale parameters `t ∉ [0, 1]`
lerpClamp : Ord a => Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
lerpClamp x0 x1 t = clamp x0 x1 (lerp x0 x1 t)
||| Truncate a float number returning the whole part. Truncate works
||| correctly on negative numbers, unlike `floor`.
||| For example, `trunc 3.58 == 3.0` and `trunc (-4.9) == (-4.0)`
trunc : F64 -> F64
trunc x = cast {to=F64} $ cast {to=Integer} x
||| Return the fractional part of a float number.
||| For example, `frac 3.58 == 0.58` and `fract (-4.9) == (-0.9)`
fract : F64 -> F64
fract x = x - trunc x
data Interval a
= OpenAbove a -- [a ..)
| OpenBelow a -- (.. a]
| Closed a a -- [a .. b]
||| Return the fractional index of a list as an `Interval` of the adjacent integer
||| number entries. A negative index will result in an `OpenBelow` interval while
||| an index greater than the largest integer index will result in an `OpenAbove`
||| interval.
fractIndex : (xs : List1 a) -> (idx : F64) -> Interval a
fractIndex (x ::: xs) idx = if idx < 0.0 then OpenBelow x
else go x xs (cast idx)
where
go : a -> List a -> Nat -> Interval a
go x [] idx = OpenAbove x
go x (y :: _) 0 = Closed x y
go _ (y :: ys) (S idx) = go y ys idx
||| Return the empirical quantile, interpolating between the two adjacent integer
||| number entries.
||| Assumes `xs` is sorted. Handles `q ∉ [0, 1]` by returning the min or max
||| respectively.
empiricalQuantile : (xs : List1 F64) -> (q : F64) -> F64
empiricalQuantile xs q
= let n = cast {to=F64} $ length xs
idx = (n * q) - 1
t = fract idx in
case fractIndex xs idx of
OpenAbove x => x
OpenBelow x => x
Closed x y => lerpClamp x y t
||| Replace elements below/above `q`/`1-q` quantile with the `q`/`1-q` quantile
||| respectively. Reduces impact of outliers without changing the number of samples.
|||
||| Note: assumes xs is sorted
||| See: https://en.wikipedia.org/wiki/Winsorising
|||
||| ### Example
|||
||| ```idris2
||| > winsorize [1.0,2,3,4,5,6,7,8,9,10] 0.25
||| [2.5 2.5, 3.0, 4.0, 5.0, 6.0, 7.0, 7.5, 7.5, 7.5]
||| ```
winsorize : (xs : List1 F64) -> (q : F64) -> List1 F64
winsorize xs q
= let q = clamp 0.0 1.0 q
q = if q > 0.5 then 1.0 - q else q
lo = empiricalQuantile xs q
hi = empiricalQuantile xs (1.0 - q) in
map (clamp lo hi) xs
winsorizeUnsorted : (xs : List1 F64) -> (q : F64) -> List1 F64
winsorizeUnsorted xs q = winsorize (sort xs) q